{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Document embeddings in BigQuery\n",
    "\n",
    "This notebook shows how to do use a pre-trained embedding as a vector representation of a natural language text column.\n",
    "Given this embedding, we can use it in machine learning models."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Embedding model for documents\n",
    "\n",
    "We're going to use a model that has been pretrained on Google News. Here's an example of how it works in Python. We will use it directly in BigQuery, however."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Model: \"sequential_4\"\n",
      "_________________________________________________________________\n",
      "Layer (type)                 Output Shape              Param #   \n",
      "=================================================================\n",
      "keras_layer_3 (KerasLayer)   (None, 20)                400020    \n",
      "=================================================================\n",
      "Total params: 400,020\n",
      "Trainable params: 0\n",
      "Non-trainable params: 400,020\n",
      "_________________________________________________________________\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "array([[ 0.52828205, -0.814417  ,  2.7678437 , -0.70152074, -0.99541044,\n",
       "        -2.9311025 , -1.3798233 ,  0.10915907,  0.8491049 , -1.6155498 ,\n",
       "        -1.1453229 ,  1.2871503 , -1.0593784 ,  0.32060066, -3.060015  ,\n",
       "         2.4751766 ,  2.9106884 , -2.6531873 , -2.379123  , -0.58328384]],\n",
       "      dtype=float32)"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import tensorflow as tf\n",
    "import tensorflow_hub as tfhub\n",
    "\n",
    "model = tf.keras.Sequential()\n",
    "model.add(tfhub.KerasLayer(\"https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1\",\n",
    "                           output_shape=[20], input_shape=[], dtype=tf.string))\n",
    "model.summary()\n",
    "model.predict([\"\"\"\n",
    "Long years ago, we made a tryst with destiny; and now the time comes when we shall redeem our pledge, not wholly or in full measure, but very substantially. At the stroke of the midnight hour, when the world sleeps, India will awake to life and freedom.\n",
    "A moment comes, which comes but rarely in history, when we step out from the old to the new -- when an age ends, and when the soul of a nation, long suppressed, finds utterance. \n",
    "\"\"\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading model into BigQuery\n",
    "\n",
    "The Swivel model above is already available in SavedModel format. But we need it on Google Cloud Storage before we can load it into BigQuery."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "assets/\n",
      "assets/tokens.txt\n",
      "saved_model.pb\n",
      "variables/\n",
      "variables/variables.data-00000-of-00001\n",
      "variables/variables.index\n",
      "Model artifacts are now at gs://ai-analytics-solutions-kfpdemo/swivel/*\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Copying file://swivel/swivel.tar.gz [Content-Type=application/x-tar]...\n",
      "Copying file://swivel/saved_model.pb [Content-Type=application/octet-stream]...\n",
      "Copying file://swivel/assets/tokens.txt [Content-Type=text/plain]...\n",
      "Copying file://swivel/variables/variables.index [Content-Type=application/octet-stream]...\n",
      "Copying file://swivel/variables/variables.data-00000-of-00001 [Content-Type=application/octet-stream]...\n",
      "/ [5/5 files][  3.2 MiB/  3.2 MiB] 100% Done                                    \n",
      "Operation completed over 5 objects/3.2 MiB.                                      \n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "BUCKET=ai-analytics-solutions-kfpdemo   # CHANGE AS NEEDED\n",
    "\n",
    "rm -rf tmp\n",
    "mkdir tmp\n",
    "FILE=swivel.tar.gz\n",
    "wget --quiet -O tmp/swivel.tar.gz  https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1?tf-hub-format=compressed\n",
    "cd tmp\n",
    "tar xvfz swivel.tar.gz\n",
    "cd ..\n",
    "mv tmp swivel\n",
    "gsutil -m cp -R swivel gs://${BUCKET}/swivel\n",
    "rm -rf swivel\n",
    "\n",
    "echo \"Model artifacts are now at gs://${BUCKET}/swivel/*\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's load the model into a BigQuery dataset named advdata (create it if necessary)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: []\n",
       "Index: []"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery\n",
    "CREATE OR REPLACE MODEL advdata.swivel_text_embed\n",
    "OPTIONS(model_type='tensorflow', model_path='gs://ai-analytics-solutions-kfpdemo/swivel/*')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From the BigQuery web console, click on \"schema\" tab for the newly loaded model. We see that the input is called sentences and the output is called output_0:\n",
    "<img src=\"swivel_schema.png\" />"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>output_0</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>[-0.09961678087711334, -1.1282159090042114, 2....</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                            output_0\n",
       "0  [-0.09961678087711334, -1.1282159090042114, 2...."
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery\n",
    "SELECT output_0 FROM\n",
    "ML.PREDICT(MODEL advdata.swivel_text_embed,(\n",
    "SELECT \"Long years ago, we made a tryst with destiny; and now the time comes when we shall redeem our pledge, not wholly or in full measure, but very substantially.\" AS sentences))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create lookup table\n",
    "\n",
    "Let's create a lookup table of embeddings. We'll use the comments field of a storm reports table from NOAA.\n",
    "This is an example of the Feature Store design pattern."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: []\n",
       "Index: []"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery\n",
    "CREATE OR REPLACE TABLE advdata.comments_embedding AS\n",
    "SELECT\n",
    "  output_0 as comments_embedding,\n",
    "  comments\n",
    "FROM ML.PREDICT(MODEL advdata.swivel_text_embed,(\n",
    "  SELECT comments, LOWER(comments) AS sentences\n",
    "  FROM `bigquery-public-data.noaa_preliminary_severe_storms.wind_reports`\n",
    "))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For an example of using these embeddings in text similarity or document clustering, please see the following Medium blog post: https://medium.com/@lakshmanok/how-to-do-text-similarity-search-and-document-clustering-in-bigquery-75eb8f45ab65"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copyright 2020 Google Inc. Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "environment": {
   "name": "tf2-gpu.2-1.m54",
   "type": "gcloud",
   "uri": "gcr.io/deeplearning-platform-release/tf2-gpu.2-1:m54"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
